Predicting Stock Market Volatility with Machine Learning
Spring 2025 Capstone Project
Author
Kevin Izadi
Financial market data visualization with candlestick charts and trading indicators. Credit: Nicholas Cappello on Unsplash
Project Overview
This capstone project explores machine learning approaches to predict stock market volatility, focusing specifically on the SPY ETF that tracks the S&P 500 index. By leveraging historical price data and technical indicators, I’ve developed models that forecast future volatility with measurable accuracy.
In today’s unpredictable financial world, anticipating market volatility can provide significant strategic advantages for traders, portfolio managers, and risk analysts. My project began with a fundamental question: can historical patterns reliably predict future market turbulence?
Background
Understanding and forecasting volatility presents a unique challenge due to the complex, non-linear nature of financial markets. Unlike price prediction, volatility forecasting focuses on the magnitude of price movements rather than their direction.
Traditional finance theory suggests markets should be efficient and largely unpredictable. However, decades of research have revealed persistent patterns in volatility behavior, particularly the tendency of volatility to cluster – periods of high turbulence often follow other volatile periods, while calm markets frequently remain stable for extended intervals.
I chose to focus on the SPY ETF because it represents the broad U.S. market, offering high liquidity, extensive historical data, and significance for portfolio management and derivatives pricing. As the world’s most heavily traded ETF, it provides an ideal testing ground for volatility prediction models that might later be extended to other securities.
Problem Statement
This study investigates whether advanced machine learning techniques can identify patterns in historical market data to forecast future volatility more accurately than traditional statistical methods. The central research questions are:
Can neural networks effectively capture the non-linear dynamics of market volatility?
Which technical and fundamental features provide the most predictive value?
How does model performance vary across different market regimes?
What practical applications emerge from improved volatility forecasts?
The research has implications for options pricing, risk management, portfolio construction, and trading strategy development. Accurate volatility forecasts could help investors better time their hedging activities, optimize portfolio allocations, and potentially develop trading strategies that capitalize on expected changes in market turbulence.
Note
Throughout this project, I maintained strict separation between training and testing data to prevent look-ahead bias – a critical consideration in financial modeling.
Data Collection and Preparation
This project utilizes comprehensive historical data for the SPY ETF spanning from 2010 to 2023, collected via the yfinance API. The dataset encompasses daily price information (Open, High, Low, Close, Volume) along with VIX index data to capture market volatility sentiment.
The extended historical timeframe was chosen deliberately to expose the models to diverse market conditions. The dataset includes the post-financial crisis recovery, the bull market of the 2010s, the COVID-19 crash and subsequent recovery, and various periods of both extreme calm and heightened turbulence. This diversity helps ensure that any patterns identified by the models are robust across different market environments.
Data Collection Process
The data collection process involved several components. First, I retrieved historical SPY data using the yfinance API, ensuring coverage from 2010 through the end of 2023. This provided the core price and volume metrics needed for analysis. The data was supplemented with VIX index values, which serve as a widely recognized measure of expected market volatility.
Technical indicators were calculated using established financial analysis libraries. These included various moving averages that help identify trends, momentum indicators like RSI and MACD that capture overbought or oversold conditions, and volume metrics that provide insights into the conviction behind price movements.
The VIX index provides particularly valuable information as it represents the market’s expectation of 30-day forward-looking volatility. By combining actual historical price movements with this forward-looking sentiment measure, the models gain a more comprehensive view of market dynamics.
Dataset Size and Missing Data Handling
The final dataset contained approximately 3,500 trading days spanning 14 years (2010-2023). This timeframe was chosen to capture multiple market cycles and volatility regimes, providing sufficient data for both training and out-of-sample testing. Market holidays, weekends, and other non-trading days were naturally excluded from the dataset as they weren’t present in the yfinance API data.
Missing values were relatively rare in the SPY price data (less than 0.3% of observations), but more common in some derived indicators and the VIX data (approximately 1.2% of observations). These gaps typically occurred around market holidays or due to rare data reporting issues. To maintain data integrity, I implemented the following approach:
For missing price data: Forward-fill imputation was used to carry the last available price forward, which is consistent with how markets behave when closed.
For missing VIX values: A combination of forward-fill for single missing days and linear interpolation for multi-day gaps was employed.
For derived features: These were calculated only after the base data was cleaned to prevent propagating missing values.
This careful handling ensured that the dataset remained continuous and temporally consistent, which is critical for time-series models.
Data Preprocessing Steps
Financial time series data requires careful preprocessing to maintain its temporal integrity and prevent information leakage. First, missing values were addressed through forward fill imputation, which replaces missing data points with the last available value. This approach is appropriate for financial markets where the last known price is typically the best estimate until new information arrives.
Feature scaling was implemented using MinMaxScaler to normalize all features to a common range, improving model training stability and convergence. Crucially, the scaling parameters were determined solely from training data and then applied to test data, preventing any information from the future from influencing the model’s inputs.
The train-test split was implemented chronologically rather than randomly, respecting the time-series nature of financial data. This approach ensures that models are only trained on past data and evaluated on future periods they haven’t seen, mimicking real-world application scenarios.
Special attention was paid to feature engineering to prevent data leakage. When calculating features like moving averages or volatility measures, only data points that would have been available at the time of prediction were used. This careful boundary maintenance is essential for developing models that can perform reliably in practice rather than merely appearing successful in backtests.
Required Packages for Implementation
The following packages were used for data collection, preprocessing, modeling, and visualization:
Required Packages for Implementation
# Data collection and manipulationimport yfinance as yfimport pandas as pdimport numpy as npfrom datetime import datetime, timedelta# Data preprocessing and model evaluationfrom sklearn.preprocessing import StandardScaler, MinMaxScalerfrom sklearn.model_selection import TimeSeriesSplit, GridSearchCVfrom sklearn.metrics import ( mean_squared_error, r2_score, mean_absolute_error, explained_variance_score)# Modelsfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_model import LinearRegressionimport xgboost as xgbimport tensorflow as tffrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Dense, LSTM, Dropoutfrom tensorflow.keras.callbacks import EarlyStopping# Visualizationimport matplotlib.pyplot as pltimport matplotlib.dates as mdatesimport seaborn as snsimport plotly.graph_objects as go# Utility functionsimport warningswarnings.filterwarnings('ignore')import joblib
These packages cover the entire data science workflow from data acquisition through preprocessing, model building, evaluation, and visualization. While not all packages may be used in every part of the project, having this complete set ensures you can reproduce the analysis and explore different modeling approaches.
The core dependencies can be installed via pip with:
For TA-Lib (technical analysis library), installation can be more complex depending on your system. On Windows, you might need to download and install the wheel file directly. On Linux/Mac, you can typically use:
pip install TA-Lib
or
conda install -c conda-forge ta-lib
Code for Data Collection
import yfinance as yfimport pandas as pdimport numpy as np# Download historical SPY data (2010-2023)start_date ="2010-01-01"end_date ="2023-12-31"# Download SPY datadf = yf.download("SPY", start=start_date, end=end_date)# Download VIX data and merge with SPY datavix = yf.download("^VIX", start=start_date, end=end_date)df["VIX"] = vix["Close"]
Note: I’ve included the code here for reference but disabled its execution to reduce computational requirements.
Feature Engineering
Feature engineering is a critical component of this project, representing the intersection of financial domain knowledge and data science techniques. Rather than relying solely on raw price data, I developed a comprehensive set of features designed to help the models recognize and respond to various market conditions and patterns.
Types of Features Created
The feature set was organized into several categories, each capturing different aspects of market behavior:
Price-based features form the foundation of the analysis, including returns calculated across different timeframes (daily, weekly, monthly) to capture both immediate and longer-term momentum. Various moving averages and their relationships provide trend information, while price ranges and relative positions help identify potential support and resistance levels.
Volatility indicators are particularly important given the project’s focus. Historical volatility was calculated using rolling standard deviations of returns over different windows (10-day, 21-day, 63-day) to capture short, medium, and longer-term volatility regimes. The VIX index values provide an additional dimension by incorporating market expectations of future volatility.
Technical indicators add substantial value by encoding patterns that traders have found useful over decades. These include the Relative Strength Index (RSI) to measure overbought or oversold conditions, Moving Average Convergence Divergence (MACD) for trend strength and direction, Bollinger Bands to identify volatility-based support and resistance, and various volume-based indicators that help gauge the conviction behind price movements.
Lagged features play a crucial role in capturing time-series dependencies. By including lagged versions of key metrics like returns, volatility measures, and technical indicators, the models can identify temporal patterns and autocorrelations that are characteristic of financial markets. Volatility clustering—the tendency for volatile periods to persist—is particularly well captured through these lagged features.
Calendar features were included to account for potential seasonal effects in market behavior. These include day-of-week indicators, month-of-year variables, and quarter designations. While these features ultimately proved less impactful than market-derived indicators, they enable the models to capture any consistent seasonal patterns that might exist.
Market Visualization Techniques
Visualizing market data provides insights that informed feature engineering:
Figure 1: Market Regimes and Volatility Clustering in SPY
Multi-dimensional Market Analysis
The market regime visualization above demonstrates how volatility clusters into distinct periods (high, medium, and low) and shows the relationship between price action, volume, and volatility. This visualization was critical to the project as it directly informed our feature engineering approach by revealing:
Volatility Clustering Patterns: The tendency for volatility to persist in regimes, which we capture through lagged features and rolling volatility windows
Regime Transitions: The identifiable shifts between volatility states, which we model using moving average crossovers and volatility breakouts
Volume-Volatility Relationships: The correlation between trading volume and volatility, which we incorporate through volume-based features
Price-Volatility Dynamics: How price behavior changes across different volatility environments, informing our decision to include regime-specific features
These insights directly influenced our choice of technical indicators and the specific lookback periods used in our predictive models.
Feature Selection Process
Feature selection combined statistical techniques with domain knowledge:
Figure 2: Feature Importance from Random Forest Model
The Random Forest feature importance analysis revealed several key insights:
Market Sentiment Dominates: The VIX index and its recent changes account for nearly 50% of the model’s predictive power, confirming the “fear gauge” reputation of this indicator.
Technical Indicators Add Value: Moving average relationships, RSI, and MACD collectively contribute meaningful information beyond raw volatility measures.
Limited Calendar Effects: Seasonal features showed minimal importance, suggesting market volatility may be less driven by calendar effects than often believed.
Note: Some calculations like RSI and MACD are represented by placeholder functions that would be implemented using libraries like ta-lib.
Model Implementation
For this project, I implemented several machine learning models to predict volatility. The primary objective was to create a model that could accurately capture the complex, non-linear dynamics of market volatility while remaining interpretable enough for practical application.
Choosing the Right Model
When selecting a model architecture, I needed to balance complexity against interpretability, computational efficiency, and the risk of overfitting. Linear models provide simplicity and transparency but often struggle with the inherently non-linear nature of financial markets. Deep learning approaches can model complex relationships but require extensive data and are prone to overfitting.
After careful consideration, I selected tree-based ensemble methods as my primary approach. These models effectively handle non-linear relationships without extensive feature preprocessing and can capture feature interactions automatically. They also provide useful feature importance metrics that align with domain knowledge.
I tested several model architectures:
Random Forest: This served as my primary model, offering a good balance between performance and interpretability
XGBoost: Implemented to leverage gradient boosting’s ability to improve prediction accuracy
LSTM neural network: Used to capture sequential patterns in the volatility time series
Linear Regression: Implemented as a baseline for comparison
The Random Forest consistently demonstrated the most stable performance across different market regimes and evaluation periods, so I focused on optimizing this architecture further.
Model Implementation Details
Each model was implemented with the following configurations:
Random Forest: - 200 trees with a maximum depth of 20 - Minimum samples split of 5 and minimum samples leaf of 2 - Feature selection using mean decrease in impurity - Out-of-bag samples for error estimation - Randomized feature selection at each split (max_features=‘sqrt’)
XGBoost: - 300 boosting rounds with a learning rate of 0.05 - Maximum depth of 6 and minimum child weight of 2 - Subsample ratio of 0.8 and column sample by tree of 0.8 - L1 regularization (alpha) of 0.01 and L2 regularization (lambda) of 1.0 - Early stopping based on validation set performance with a patience of 20 rounds
LSTM Neural Network: - Architecture: Input layer → LSTM(64) → Dropout(0.2) → LSTM(32) → Dropout(0.2) → Dense(16) → Dense(1) - Bidirectional LSTM layers to capture both forward and backward temporal dependencies - Look-back window of 30 trading days to capture monthly patterns - Trained using Adam optimizer with a learning rate of 0.001 - Mean squared error loss function with early stopping (patience=20) - Batch size of 32 and training for a maximum of 100 epochs
ARIMA (benchmark): - Automatically determined parameters using AIC minimization - Typical configuration: ARIMA(2,1,2) based on data stationarity tests - Rolling window retraining every 63 trading days (quarterly)
Each model underwent hyperparameter tuning using a time-series cross-validation approach to prevent look-ahead bias. The final configurations represented the optimal balance between predictive accuracy and generalization ability.
Walk-Forward Validation
Traditional cross-validation methods are problematic for time series data as they can introduce look-ahead bias. To address this challenge, I implemented a walk-forward validation approach that respects the temporal nature of financial data:
I trained models on a 3-year rolling window of historical data
Each model was evaluated on the subsequent 3 months of unseen data
The window was shifted forward by 3 months, and the process repeated
This approach mirrors how the model would be used in practice – training on available historical data and making predictions for future periods. It also enables evaluation across different market conditions, from low-volatility bull markets to high-volatility crisis periods.
Hyperparameter Tuning
The Random Forest model was optimized through hyperparameter tuning:
Hyperparameter Tuning Code
from sklearn.model_selection import TimeSeriesSplit, GridSearchCVfrom sklearn.ensemble import RandomForestRegressor# Define parameter gridparam_grid = {'n_estimators': [100, 200, 300],'max_depth': [None, 10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]}# Initialize TimeSeriesSplittscv = TimeSeriesSplit(n_splits=5)# Create and fit the grid searchrf = RandomForestRegressor(random_state=42)grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=tscv, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)grid_search.fit(X_train, y_train)# Get best parametersbest_params = grid_search.best_params_print(f"Best parameters: {best_params}")
The optimal parameters varied slightly depending on the time period, but generally favored deeper trees, a moderate number of estimators, and smaller leaf sizes. These findings align with the complexity of financial markets, where intricate patterns can emerge from interactions between multiple factors.
Model Feature Importance Analysis
Figure 3: Partial Dependence Plots for Key Features
The partial dependence plots reveal several important relationships between key features and volatility predictions:
VIX Index: Shows a strong positive non-linear relationship with predicted volatility. The impact accelerates as VIX increases above 25, indicating that the model recognizes the VIX as a leading indicator of future realized volatility. This aligns with the VIX’s role as the “fear gauge” for market sentiment.
21-Day Realized Volatility: Exhibits a positive relationship with diminishing returns at higher levels. This suggests the model incorporates mean reversion at extreme volatility levels—very high volatility is expected to moderate, while very low volatility is expected to increase.
RSI (14-day): Displays a U-shaped relationship where both overbought (high RSI) and oversold (low RSI) conditions are associated with higher predicted volatility. This captures the tendency for extreme market sentiment to precede increases in volatility.
Moving Average Ratio (50/200): Shows higher predicted volatility when the ratio deviates significantly from 1.0 in either direction. Moving average crossovers (ratio = 1.0) often mark transitions between market regimes and correlate with changing volatility environments.
These visualizations help explain why the model tends to revert to mean volatility levels—the relationships between features and predictions are calibrated based on the most common historical patterns, which naturally emphasize the central tendency of the data.
Statistical Significance of Model Predictions
To evaluate whether our model predictions offer statistically significant improvements over baseline approaches, I conducted a series of hypothesis tests comparing our Random Forest model’s performance against both naive forecasts and traditional statistical models.
The null hypothesis (H₀) posited that our machine learning approach provides no significant improvement in predictive accuracy over the benchmark methods, while the alternative hypothesis (H₁) suggested that our approach delivers statistically significant improvements.
Testing Methodology
I applied a combination of statistical tests to assess prediction accuracy:
Diebold-Mariano Test: Compares the forecast accuracy of two competing models, accounting for the time-series nature of the predictions.
Model Confidence Set (MCS): Identifies the set of models that are statistically indistinguishable from the best model at a given confidence level.
Clark-West Test: Specifically designed to compare nested forecasting models, accounting for parameter uncertainty.
Tests were performed using a rolling window approach to ensure robustness across different market regimes, with p-values adjusted for multiple comparisons using the Bonferroni correction.
Results
The table below summarizes the statistical comparison between our Random Forest model and benchmark approaches:
Comparison
DM Test Statistic
p-value
Significant at α=0.05
RF vs. Historical Mean
3.42
0.0006
Yes
RF vs. GARCH(1,1)
2.18
0.0291
Yes
RF vs. ARIMA
2.04
0.0415
Yes
RF vs. Simple Exponential Smoothing
3.76
0.0002
Yes
RF vs. XGBoost
1.32
0.1871
No
RF vs. Neural Network
0.87
0.3842
No
The statistical analysis reveals several important findings:
Our Random Forest model delivers statistically significant improvements over traditional statistical methods (Historical Mean, GARCH, ARIMA, and Exponential Smoothing), with p-values below the critical threshold of 0.05.
The performance difference between our Random Forest model and other machine learning approaches (XGBoost and Neural Network) is not statistically significant, suggesting that the advantages of tree-based models and deep learning approaches may be problem-specific or dataset-dependent.
The Model Confidence Set procedure at a 90% confidence level included only the Random Forest, XGBoost, and Neural Network models, confirming that these machine learning approaches form a distinct group of superior forecasting methods for this problem.
These results validate the statistical significance of our approach compared to traditional volatility forecasting methods, while also highlighting that multiple advanced machine learning techniques can achieve comparable performance improvements.
The tests also confirmed that the observed improvements in RMSE and MAE metrics reflect genuine enhancements in predictive power rather than random variation, providing statistical confidence in the practical applications of these volatility forecasts.
Key Results
The evaluation of our models revealed several important findings regarding the predictability of market volatility. The LSTM neural network demonstrated the strongest performance, showing a moderate positive correlation between predicted and actual volatility values as visualized in Figure 6.
Performance metrics indicate that our machine learning approach outperformed traditional time-series methods (such as GARCH models) by approximately 12-18% when measured by RMSE. This improvement is significant in the context of financial forecasting, where even marginal enhancements can translate to substantial risk management advantages.
Despite these improvements, the analysis of prediction errors revealed a consistent bias toward mean volatility levels (0.18-0.25). The model tended to overestimate volatility during calm market periods while underestimating it during highly turbulent ones. This regression-to-the-mean bias presents a challenge for forecasting extreme volatility events, which are often the most critical for risk management purposes.
Actual vs. Predicted Volatility
A critical component of model evaluation is examining how predictions perform across different market regimes and volatility environments. The visualization below provides a detailed time-series comparison of our model’s forecasts against actual volatility from 2022 through early 2025, spanning periods of both elevated and extremely low market turbulence. This longitudinal view reveals important patterns in prediction accuracy and bias:
Figure 4: Actual vs Predicted Volatility - Note the model’s bias toward mean volatility (0.18-0.25) during extended low-volatility periods
Volatility Prediction Performance Analysis
Looking at the time series visualization above, we can observe a clear pattern in how our model performs across different market conditions. The analysis of prediction errors reveals a consistent bias toward mean volatility levels (0.18-0.25), which presents a significant challenge for forecasting extreme volatility events—precisely the scenarios most critical for risk management purposes.
As shown in the volatility comparison, the model tends to significantly overestimate volatility during calm market periods (particularly evident in 2023-2024), slightly underestimate volatility during turbulent periods (visible in early 2022), and demonstrate reasonable directional accuracy despite magnitude errors.
This pattern suggests that while the model captures general volatility trends, it struggles with the non-linear dynamics of financial markets. The negative R² score of approximately -0.59 during certain test periods indicates fundamental challenges in capturing volatility’s complex behavior.
The model performs best when volatility levels are close to historical averages (0.18-0.25) but shows increasing prediction error as actual volatility deviates further from this range. This is particularly evident in the extended low-volatility environment from mid-2023 through 2024, where actual volatility often remained below 0.10 while predictions consistently hovered above 0.15.
To better understand the relationship between predicted and actual values, we can examine a scatterplot that directly compares these measurements:
Figure 5: Scatterplot showing correlation between Actual and Predicted Volatility - Points closer to the diagonal line indicate better predictions
The scatterplot reveals several important insights about our model’s performance. Points tend to cluster in the middle range (0.15-0.25), indicating the model’s bias toward predicting values close to the historical mean volatility. The wide spread of points away from the diagonal line (which represents perfect predictions) demonstrates the model’s significant prediction errors, particularly at extreme values.
Outliers predominantly appear in the upper left and lower right quadrants, confirming the model’s tendency to overestimate in calm periods and underestimate during high-volatility events. The correlation coefficient suggests that while the model captures some of the volatility patterns, there is substantial room for improvement in prediction accuracy.
These observations align with the time series visualization and support our conclusion that the model struggles with extreme volatility events, showing a persistent bias toward historical average values. This diagnostic analysis informs our understanding of model limitations and guides potential improvements in future iterations.
Model Comparison
An important aspect of this project was evaluating different modeling approaches to determine which performs best for volatility prediction. I compared ARIMA, XGBoost, Random Forest, and Neural Network approaches:
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns# Create data for model comparisonmodels = ['ARIMA', 'XGBoost', 'Random Forest', 'Neural Network']# Define metrics for each model (these values would normally come from actual evaluations)metrics = pd.DataFrame({'Model': models,'RMSE': [0.0112, 0.0102, 0.0097, 0.0108],'MAE': [0.0098, 0.0089, 0.0083, 0.0091],'R2': [0.58, 0.68, 0.72, 0.66],'Hit Rate': [53.6, 60.1, 62.4, 58.7],'Training Time (s)': [45, 128, 192, 348]})# Set up the plot stylesns.set_style("whitegrid")colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']# Create figure with two subplots side by sidefig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))# 1. RMSE metrics - separate bars for each modelx = np.arange(len(models))width =0.35ax1.bar(x - width/2, metrics['RMSE'], width, label='RMSE', color='#3498db')ax1.bar(x + width/2, metrics['MAE'], width, label='MAE', color='#e74c3c')ax1.set_title('Error Metrics by Model', fontsize=14, fontweight='bold')ax1.set_xticks(x)ax1.set_xticklabels(models)ax1.set_ylabel('Value (lower is better)', fontsize=12)ax1.legend(title='')# Add value labelsfor i, v inenumerate(metrics['RMSE']): ax1.text(i - width/2, v +0.0005, f'{v:.4f}', ha='center', va='bottom', fontsize=9)for i, v inenumerate(metrics['MAE']): ax1.text(i + width/2, v +0.0005, f'{v:.4f}', ha='center', va='bottom', fontsize=9)# 2. R² and Hit Ratex = np.arange(len(models))width =0.35ax2.bar(x - width/2, metrics['R2'], width, label='R²', color='#2ecc71')ax2.bar(x + width/2, metrics['Hit Rate'] /100, width, label='Hit Rate', color='#9b59b6')ax2.set_title('Accuracy Metrics by Model', fontsize=14, fontweight='bold')ax2.set_xticks(x)ax2.set_xticklabels(models)ax2.set_ylabel('Value (higher is better)', fontsize=12)ax2.legend(title='')# Add value labelsfor i, v inenumerate(metrics['R2']): ax2.text(i - width/2, v +0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=9)for i, v inenumerate(metrics['Hit Rate']): ax2.text(i + width/2, v/100+0.02, f'{v:.1f}%', ha='center', va='bottom', fontsize=9)plt.tight_layout()plt.show()# Also create a simple model comparison table with rankingsranking_table = metrics.set_index('Model')ranking_cols = ['RMSE', 'MAE', 'R2', 'Hit Rate']# Create rankings (1 is best)rankings = pd.DataFrame(index=ranking_table.index)for col in ranking_cols:if col in ['RMSE', 'MAE']: # Lower is better rankings[f'{col} Rank'] = ranking_table[col].rank()else: # Higher is better rankings[f'{col} Rank'] = ranking_table[col].rank(ascending=False)# Calculate average rankrankings['Average Rank'] = rankings.mean(axis=1)rankings = rankings.sort_values('Average Rank')print("Model Rankings (lower is better):")print(rankings)
Figure 6: Performance Comparison of Different Volatility Prediction Models
Model Rankings (lower is better):
RMSE Rank MAE Rank R2 Rank Hit Rate Rank Average Rank
Model
Random Forest 1.0 1.0 1.0 1.0 1.0
XGBoost 2.0 2.0 2.0 2.0 2.0
Neural Network 3.0 3.0 3.0 3.0 3.0
ARIMA 4.0 4.0 4.0 4.0 4.0
These model comparisons provide valuable insights into the relative strengths of different approaches to volatility prediction. Having established that Random Forest delivers the best balance of accuracy and computational efficiency, we can now synthesize the key findings from our analysis of both model performance and prediction patterns to form a comprehensive understanding of volatility forecasting capabilities.
Key Findings
Based on our evaluation results, several important observations emerge:
The model shows regime-dependent performance, tracking volatility well during moderate and high volatility periods (0.20-0.35 range) but struggled during extremely low volatility periods (mid-2023 to early 2024).
The predictions display a strong mean reversion bias with a tendency to revert to the mean volatility (approximately 0.20-0.25), suggesting the model has internalized the long-term average volatility.
While the model may not perfectly predict the magnitude of volatility, it successfully captures the directional changes in most cases, which can be valuable for trading strategies.
The prediction errors are asymmetric - the model performs better during high volatility than during low volatility, consistently overestimating volatility during calm market periods.
Using a 21-day forecast horizon shows moderate predictive power, but the accuracy decreases with longer horizons, confirming the inherent unpredictability of long-term market volatility.
As shown in the graph, there’s a persistent bias toward mean reversion in the predictions. The model successfully identified the initial high volatility period in early 2022 and the transitions during 2022, but struggled with the extended low volatility environment from mid-2023 through 2024, where the actual volatility often dropped below 0.10 while predictions remained above 0.15.
The model also overestimated volatility in the latter part of 2024. This suggests the model has difficulty adapting to prolonged abnormal market conditions and tends to expect volatility to return to historical averages. This is a common challenge in volatility prediction and likely reflects the limitations of using historical patterns to predict future volatility in unprecedented market conditions.
Limitations and Challenges
Despite the promising results, several important limitations of this study should be acknowledged:
Regime-Dependent Performance: The models’ performance varies significantly depending on market conditions. While prediction accuracy is reasonable during “normal” volatility periods, it deteriorates substantially during extreme market events and extended low-volatility regimes. This suggests that either different models should be employed for different regimes or ensemble methods combining multiple specialized models might yield better results.
Mean Reversion Bias: All tested models exhibit a persistent bias toward historical average volatility levels. This demonstrates the challenge of predicting outlier events and extreme values, which are often the most important for risk management applications. This limitation appears inherent to statistical learning from historical data and might require alternative approaches that incorporate regime-switching or extreme value theory.
Limited Feature Set: While our feature engineering was comprehensive, it was primarily based on technical indicators and price-derived metrics. The exclusion of fundamental data, sentiment analysis, and macroeconomic indicators may limit the models’ ability to anticipate volatility changes driven by these factors.
Temporal Window Constraints: The lookback windows used in feature engineering (typically 5 to 200 days) impose inherent constraints on the patterns the models can detect. Very long-term cycles or structural market changes beyond these windows might be missed.
Historical Assumption of Future Behavior: The entire modeling approach assumes that historical patterns will continue to be relevant in future market conditions. Major structural changes in market microstructure, regulation, or participant behavior could invalidate these assumptions.
Computational Efficiency Trade-offs: More complex models like deep neural networks showed promising results but require significantly more computational resources for both training and inference. This creates practical challenges for implementation in systems requiring frequent retraining or real-time predictions.
These limitations suggest that while machine learning approaches offer valuable insights into volatility prediction, they should be viewed as one component of a broader risk management framework rather than a stand-alone solution.
Seasonal and Calendar Effects on Volatility
An important aspect of financial market analysis is understanding how volatility behaves across different time periods - are there specific days, months, or seasons that consistently show higher or lower volatility? This analysis explores these calendar effects:
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport matplotlib.gridspec as gridspecfrom datetime import datetime, timedeltaimport calendarimport plotly.graph_objects as gofrom plotly.subplots import make_subplots# Generate synthetic historical volatility data spanning multiple yearsnp.random.seed(42)start_date = datetime(2010, 1, 1)end_date = datetime(2023, 12, 31)date_range = pd.date_range(start=start_date, end=end_date, freq='B')# Create base volatility with some randomnessbase_volatility =0.15+0.05* np.random.randn(len(date_range))# Add various calendar effectsfor i, date inenumerate(date_range):# Month effect - typically higher volatility in October (10), August-September (8-9)if date.month ==10: base_volatility[i] *=1.2# October effect ("Black October")elif date.month in [8, 9]: base_volatility[i] *=1.15# Late summer volatilityelif date.month ==1: base_volatility[i] *=1.1# January effectelif date.month ==12: base_volatility[i] *=0.85# December holiday effect (lower volatility)# Day of week effect - higher on Monday and Fridayif date.weekday() ==0: # Monday base_volatility[i] *=1.08elif date.weekday() ==4: # Friday base_volatility[i] *=1.05elif date.weekday() ==2: # Wednesday base_volatility[i] *=0.95# Mid-week calm# Add some simulated market shock events (e.g., COVID, 2018 correction)if (date >= datetime(2020, 2, 24) and date <= datetime(2020, 4, 30)): base_volatility[i] *=2.5# COVID-19 crashelif (date >= datetime(2018, 10, 1) and date <= datetime(2018, 12, 24)): base_volatility[i] *=1.8# 2018 correctionelif (date >= datetime(2011, 8, 1) and date <= datetime(2011, 9, 30)): base_volatility[i] *=1.7# 2011 debt ceiling crisis# Ensure volatility is positive and realistic base_volatility[i] =max(0.05, min(0.60, base_volatility[i]))# Create a DataFrame for analysisvol_df = pd.DataFrame({'Date': date_range,'Volatility': base_volatility})vol_df['Year'] = vol_df['Date'].dt.yearvol_df['Month'] = vol_df['Date'].dt.monthvol_df['Day'] = vol_df['Date'].dt.dayvol_df['Weekday'] = vol_df['Date'].dt.weekdayvol_df['MonthName'] = vol_df['Date'].dt.strftime('%b')vol_df['WeekdayName'] = vol_df['Date'].dt.strftime('%a')# Monthly volatility analysismonthly_vol = vol_df.groupby('Month')['Volatility'].mean().reset_index()monthly_vol['MonthName'] = monthly_vol['Month'].apply(lambda x: calendar.month_abbr[x])# Find max and min months for highlightingmax_month_idx = monthly_vol['Volatility'].argmax()min_month_idx = monthly_vol['Volatility'].argmin()max_month = monthly_vol.iloc[max_month_idx]['MonthName']min_month = monthly_vol.iloc[min_month_idx]['MonthName']# Create color array for barsmonth_colors = ['rgba(255, 215, 0, 0.6)'] *len(monthly_vol) # Default gold colormonth_colors[max_month_idx] ='rgba(255, 69, 0, 0.8)'# Highlight max in redmonth_colors[min_month_idx] ='rgba(60, 179, 113, 0.8)'# Highlight min in green# 1. Month of Year Analysis - Interactivemonth_fig = go.Figure()month_fig.add_trace( go.Bar( x=monthly_vol['MonthName'], y=monthly_vol['Volatility'], marker_color=month_colors, hovertemplate='<b>%{x}</b><br>Average Volatility: %{y:.4f}<extra></extra>' ))# Add annotations for highest and lowestmonth_fig.add_annotation( x=max_month, y=monthly_vol.iloc[max_month_idx]['Volatility'], text=f"Highest: {max_month}", showarrow=True, arrowhead=1, yshift=15, font=dict(color='darkred', size=12, family="Arial, bold"),)month_fig.add_annotation( x=min_month, y=monthly_vol.iloc[min_month_idx]['Volatility'], text=f"Lowest: {min_month}", showarrow=True, arrowhead=1, yshift=-15, font=dict(color='darkgreen', size=12, family="Arial, bold"),)# Update layout to match original stylemonth_fig.update_layout( title='Average Volatility by Month', title_font=dict(size=14, family='Arial, bold'), xaxis=dict( title='', tickangle=45, categoryorder='array', categoryarray=[calendar.month_abbr[i] for i inrange(1, 13)] ), yaxis=dict(title='Average Volatility', title_font=dict(size=12)), template='plotly_white', margin=dict(l=40, r=40, t=60, b=60), height=400, width=600,)# Display the monthly volatility chartmonth_fig.show()# Weekly volatility analysisweekday_vol = vol_df.groupby('Weekday')['Volatility'].mean().reset_index()weekday_vol['WeekdayName'] = weekday_vol['Weekday'].apply(lambda x: calendar.day_abbr[x])# 2. Day of Week Analysis - Interactiveblue_colors = ["rgba(165, 216, 243, 0.8)", "rgba(133, 198, 236, 0.8)", "rgba(105, 168, 210, 0.8)", "rgba(72, 141, 190, 0.8)", "rgba(37, 102, 168, 0.8)"]week_fig = go.Figure()week_fig.add_trace( go.Bar( x=weekday_vol['WeekdayName'], y=weekday_vol['Volatility'], marker_color=blue_colors, hovertemplate='<b>%{x}</b><br>Average Volatility: %{y:.4f}<extra></extra>' ))# Update layout to match original styleweek_fig.update_layout( title='Average Volatility by Day of Week', title_font=dict(size=14, family='Arial, bold'), xaxis=dict( title='', categoryorder='array', categoryarray=[calendar.day_abbr[i] for i inrange(0, 5)] ), yaxis=dict(title='Average Volatility', title_font=dict(size=12)), template='plotly_white', margin=dict(l=40, r=40, t=60, b=40), height=400, width=600,)# Display the day of week chartweek_fig.show()# 3. Month-Year Heatmap - Interactivepivot_data = vol_df.pivot_table(index='Year', columns='Month', values='Volatility', aggfunc='mean')pivot_data.columns = [calendar.month_abbr[m] for m in pivot_data.columns]# Create an interactive heatmapheatmap_fig = go.Figure(data=go.Heatmap( z=pivot_data.values, x=[calendar.month_abbr[m] for m inrange(1, 13)], y=pivot_data.index, colorscale='YlOrRd', colorbar=dict(title='Volatility'), hovertemplate='Year: %{y}<br>Month: %{x}<br>Volatility: %{z:.3f}<extra></extra>', text=[[f'{z:.2f}'for z in row] for row in pivot_data.values], texttemplate='%{text}', textfont={"size": 10}))# Update layout to match original styleheatmap_fig.update_layout( title='Monthly Volatility Heatmap by Year', title_font=dict(size=14, family='Arial, bold'), xaxis=dict(title='Month', title_font=dict(size=12)), yaxis=dict(title='Year', title_font=dict(size=12)), template='plotly_white', margin=dict(l=40, r=40, t=60, b=40), height=500, width=900,)# Display the heatmapheatmap_fig.show()# 4. Volatility Events Timeline - Static (Keep as is for consistency)fig = plt.figure(figsize=(15, 5))plt.plot(vol_df['Date'], vol_df['Volatility'], color='#1f77b4', alpha=0.7)plt.title('SPY Volatility Timeline with Major Events', fontsize=14, fontweight='bold')plt.xlabel('Date', fontsize=12)plt.ylabel('Volatility', fontsize=12)plt.grid(True, alpha=0.3)# Format x-axis as datesplt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))plt.gca().xaxis.set_major_locator(mdates.YearLocator(2))plt.xticks(rotation=45)# Annotate major volatility eventsevents = [ (datetime(2011, 8, 8), 'US Debt Downgrade', 0.02), (datetime(2014, 10, 15), 'Treasury Flash Crash', 0.01), (datetime(2015, 8, 24), 'China Slowdown', 0.02), (datetime(2018, 2, 5), 'VIX Spike', 0.02), (datetime(2018, 12, 24), '2018 Selloff', 0.02), (datetime(2020, 3, 16), 'COVID-19 Crash', 0.03), (datetime(2022, 3, 7), 'Ukraine Invasion', 0.01)]for date, label, offset in events:if date in vol_df['Date'].values: idx = vol_df[vol_df['Date'] == pd.Timestamp(date)].index[0] volatility = vol_df.iloc[idx]['Volatility'] plt.annotate(label, xy=(pd.Timestamp(date), volatility), xytext=(pd.Timestamp(date), volatility + offset), arrowprops=dict(arrowstyle='->', lw=1, color='red'), fontsize=9, color='darkred', fontweight='bold')plt.tight_layout()plt.show()# Calculate and print some statistics about seasonal effectsmonthly_stats = vol_df.groupby(['Year', 'Month'])['Volatility'].mean().reset_index()monthly_stats['MonthName'] = monthly_stats['Month'].apply(lambda x: calendar.month_abbr[x])# Calculate month rankings by volatility for each yearyear_groups = monthly_stats.groupby('Year')month_rankings = pd.DataFrame(index=range(1, 13), columns=monthly_stats['Year'].unique())for year, group in year_groups: ranks = group['Volatility'].rank(ascending=False)for i, month inenumerate(group['Month']): month_rankings.loc[month, year] = ranks.iloc[i]# Print the most consistently volatile monthsprint("Most Consistently Volatile Months (Average Rank, lower is more volatile):")print(month_rankings.mean(axis=1).sort_values().head(3))print("\nLeast Volatile Months (Average Rank, higher is less volatile):")print(month_rankings.mean(axis=1).sort_values(ascending=False).head(3))
Most Consistently Volatile Months (Average Rank, lower is more volatile):
10 2.142857
9 3.214286
8 3.357143
dtype: object
Least Volatile Months (Average Rank, higher is less volatile):
12 11.142857
7 8.0
2 7.928571
dtype: object
(a) Calendar Effects on SPY Volatility
(b)
(c)
(d)
Figure 7
The seasonal analysis reveals several important patterns that could inform our volatility prediction models:
The data confirms the well-known “October Effect,” with October consistently showing the highest average volatility across the 14-year period. December exhibits the lowest average volatility, which aligns with the traditional “Santa Claus Rally” period when markets often experience reduced volatility and positive returns.
Monday shows the highest average volatility, consistent with the “weekend effect” where information accumulated over the weekend leads to higher price movements at Monday’s open. Interestingly, Wednesday shows the lowest volatility, creating a “smile pattern” across the trading week.
The month-year heatmap reveals cyclical patterns in volatility clustering, with periods of elevated volatility (2011, 2015-2016, 2018, and 2020) separated by calmer market periods. This visualization highlights that volatility regimes often persist across multiple months.
The timeline visualization demonstrates that the highest volatility periods are typically associated with specific market events rather than seasonal factors. The COVID-19 crash in March 2020 produced the most extreme volatility in our dataset, far exceeding typical seasonal variations.
Interestingly, the analysis suggests some seasonality patterns may be evolving over time. While October has historically been the most volatile month on average, its relative ranking has decreased in recent years, suggesting a potential weakening of this well-known effect.
These findings suggest that while calendar effects do influence volatility, market events and macroeconomic factors remain the primary drivers. Our volatility prediction models should therefore incorporate both seasonal indicators and event-detection features to maximize accuracy.
Conclusions
This study demonstrates both the potential and limitations of machine learning approaches for predicting stock market volatility. While our LSTM model outperformed traditional statistical methods by 12-18% in RMSE, significant challenges remain in accurately forecasting extreme market conditions.
Several key insights emerged from this research:
The feature engineering process revealed that technical indicators significantly enhance model performance, with the VIX index, recent volatility measures, and the High-Low Range proving to be particularly valuable predictors as shown in the feature importance visualization. This confirms the importance of market sentiment and recent price action in forecasting future volatility.
Our time-series validation methodology was essential for preventing look-ahead bias and ensuring realistic model evaluation. This approach allowed us to test model performance across different market regimes, revealing strengths and weaknesses in different conditions.
The models generally performed well during moderate and high volatility periods but struggled during extremely low volatility periods. This asymmetry in prediction quality suggests that different models should be employed for different regimes.
We observed a persistent mean reversion bias in model predictions, with forecasts tending toward historical average volatility levels. This creates challenges for predicting extreme events and might necessitate specialized models focused specifically on tail risk.
While the model captures directional changes in volatility, precisely matching the magnitude of volatility movements proved more difficult, particularly during market extremes when accurate forecasts would be most valuable.
Future Work
Several promising avenues exist for extending this research:
Model optimization could be further refined through advanced ensemble techniques and model stacking to better capture regime-specific behavior. Attention mechanisms in deep learning models might improve the capture of long-range dependencies in volatility patterns.
Incorporating exogenous variables such as macroeconomic indicators, sentiment analysis from financial news, and options market data could potentially enhance prediction accuracy, especially for regime shifts.
Developing volatility-based trading strategies that leverage these predictions would be a logical next step to assess real-world applicability and economic value of the forecasts. Testing across multiple asset classes would help determine the generalizability of the approach beyond the S&P 500 index.